167 research outputs found

    Reconciling Graphs and Sets of Sets

    Full text link
    We explore a generalization of set reconciliation, where the goal is to reconcile sets of sets. Alice and Bob each have a parent set consisting of ss child sets, each containing at most hh elements from a universe of size uu. They want to reconcile their sets of sets in a scenario where the total number of differences between all of their child sets (under the minimum difference matching between their child sets) is dd. We give several algorithms for this problem, and discuss applications to reconciliation problems on graphs, databases, and collections of documents. We specifically focus on graph reconciliation, providing protocols based on set of sets reconciliation for random graphs from G(n,p)G(n,p) and for forests of rooted trees

    Optimal plans for aggregation

    Get PDF
    We consider the following problem, which arises in the context of distributed Web computations. An aggregator aims to combine specific data from n sources. The aggregator contacts all sources at once. The time for each source to return its data to the aggregator is independent and identically distributed according to a known distribution. The aggregator at some point stops waiting for data and returns an answer depending only on the data received so far. If the aggregator returns the aggregated information from k of the n sources at time t it obtains a reward Rk(t) thatgrowswithk and decreases with t. The goal of the aggregator is to maximize its expected reward. We prove that for certain broad families of distributions and broad classes of reward functions, the optimal plan for the aggregator has a simple form and hence can be easily computed

    Delphic Costs and Benefits in Web Search: A utilitarian and historical analysis

    Full text link
    We present a new framework to conceptualize and operationalize the total user experience of search, by studying the entirety of a search journey from an utilitarian point of view. Web search engines are widely perceived as "free". But search requires time and effort: in reality there are many intermingled non-monetary costs (e.g. time costs, cognitive costs, interactivity costs) and the benefits may be marred by various impairments, such as misunderstanding and misinformation. This characterization of costs and benefits appears to be inherent to the human search for information within the pursuit of some larger task: most of the costs and impairments can be identified in interactions with any web search engine, interactions with public libraries, and even in interactions with ancient oracles. To emphasize this innate connection, we call these costs and benefits Delphic, in contrast to explicitly financial costs and benefits. Our main thesis is that the users' satisfaction with a search engine mostly depends on their experience of Delphic cost and benefits, in other words on their utility. The consumer utility is correlated with classic measures of search engine quality, such as ranking, precision, recall, etc., but is not completely determined by them. To argue our thesis, we catalog the Delphic costs and benefits and show how the development of search engines over the last quarter century, from classic Information Retrieval roots to the integration of Large Language Models, was driven to a great extent by the quest of decreasing Delphic costs and increasing Delphic benefits. We hope that the Delphic costs framework will engender new ideas and new research for evaluating and improving the web experience for everyone.Comment: 10 page

    Operator Drowsiness Test

    Get PDF
    This publication details a quantifiable and objective operator drowsiness test. The test takes between 30 seconds to two (2) minutes to be administered. Any smartphone that has a front-facing camera and the supporting software can run the newly-developed and self-administrable test. It leverages years in sleep deprivation research that have found objective correlations between drowsiness (or alertness) and physical and behavioral parameters, such as: gazing, facial features, pupil size, blink rate, blink duration, breathing, pulse, head movements, face skin-tone, speech pattern, and vocal sound. In addition, the mass use of smartphones with rear-facing and front-facing cameras gives researchers the opportunity to deploy this new operator drowsiness test to a wide audience

    On-line load balancing

    Get PDF
    AbstractThe setup for our problem consists of n servers that must complete a set of tasks. Each task can be handled only by a subset of the servers, requires a different level of service, and once assigned cannot be reassigned. We make the natural assumption that the level of service is known at arrival time, but that the duration of service is not. The on-line load balancing problem is to assign each task to an appropriate server in such a way that the maximum load on the servers is minimized. In this paper we derive matching upper and lower bounds for the competitive ratio of the on-line greedy algorithm for this problem, namely, [(3n)23/2](1+o(1)), and derive a lower bound, Ω(n12), for any other deterministic or randomized on-line algorithm

    Torts

    Get PDF

    Data-driven evaluation metrics for heterogeneous search engine result pages

    Get PDF
    Evaluation metrics for search typically assume items are homoge- neous. However, in the context of web search, this assumption does not hold. Modern search engine result pages (SERPs) are composed of a variety of item types (e.g., news, web, entity, etc.), and their influence on browsing behavior is largely unknown. In this paper, we perform a large-scale empirical analysis of pop- ular web search queries and investigate how different item types influence how people interact on SERPs. We then infer a user brows- ing model given people’s interactions with SERP items – creating a data-driven metric based on item type. We show that the proposed metric leads to more accurate estimates of: (1) total gain, (2) total time spent, and (3) stopping depth – without requiring extensive parameter tuning or a priori relevance information. These results suggest that item heterogeneity should be accounted for when de- veloping metrics for SERPs. While many open questions remain concerning the applicability and generalizability of data-driven metrics, they do serve as a formal mechanism to link observed user behaviors directly to how performance is measured. From this approach, we can draw new insights regarding the relationship be- tween behavior and performance – and design data-driven metrics based on real user behavior rather than using metrics reliant on some hypothesized model of user browsing behavior

    Nobody cares if you liked Star Wars: KNN graph construction on the cheap

    Get PDF
    International audienceK-Nearest-Neighbors (KNN) graphs play a key role in a large range of applications. A KNN graph typically connects entities characterized by a set of features so that each entity becomes linked to its k most similar counterparts according to some similarity function. As datasets grow, KNN graphs are unfortunately becoming increasingly costly to construct, and the general approach, which consists in reducing the number of comparisons between entities, seems to have reached its full potential. In this paper we propose to overcome this limit with a simple yet powerful strategy that samples the set of features of each entity and only keeps the least popular features. We show that this strategy outperforms other more straightforward policies on a range of four representative datasets: for instance, keeping the 25 least popular items reduces computational time by up to 63%, while producing a KNN graph close to the ideal one

    Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS

    Get PDF
    Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. There are several variants of the similarity search problem, and one of the most relevant is the rr-near neighbor (rr-NN) problem: given a radius r>0r>0 and a set of points SS, construct a data structure that, for any given query point qq, returns a point pp within distance at most rr from qq. In this paper, we study the rr-NN problem in the light of fairness. We consider fairness in the sense of equal opportunity: all points that are within distance rr from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. To address this, we propose efficient data structures for rr-NN where all points in SS that are near qq have the same probability to be selected and returned by the query. Specifically, we first propose a black-box approach that, given any LSH scheme, constructs a data structure for uniformly sampling points in the neighborhood of a query. Then, we develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights (un)fairness in a recommendation setting on real-world datasets and discusses the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS), Pages 191-204, June 202
    • 

    corecore